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Abstract 



We present a probabilistic generative model for timing deviations in expressive mu- 
sic performance. The structure of the proposed model is equivalent to a switching state 
space model. The switch variables correspond to discrete note locations as in a musical 
score. The continuous hidden variables denote the tempo. We formulate two well known 
music recognition problems, namely tempo tracking and automatic transcription (rhythm 
quantization) as filtering and maximum a posteriori (MAP) state estimation tasks. Ex- 
act computation of posterior features such as the MAP state is intractable in this model 
class, so we introduce Monte Carlo methods for integration and optimization. We compare 
Markov Chain Monte Carlo (MCMC) methods (such as Gibbs sampling, simulated anneal- 
ing and iterative improvement) and sequential Monte Carlo methods (particle filters). Our 
simulation results suggest better results with sequential methods. The methods can be 
applied in both online and batch scenarios such as tempo tracking and transcription and 
are thus potentially useful in a number of music applications such as adaptive automatic 
accompaniment, score typesetting and music information retrieval. 

1. Introduction 

Automatic music transcription refers to extraction of a human readable and interpretable 
description from a recording of a musical performance. Traditional music notation is such 
a description that lists the pitch levels (notes) and corresponding timestamps. 

Ideally, one would like to recover a score directly from the audio signal. Such a represen- 
tation of the surface structure of music would be very useful in music information retrieval 
(Music-IR) and content description of musical material in large audio databases. However, 
when operating on sampled audio data from polyphonic acoustical signals, extraction of a 
score-like description is a very challenging auditory scene analysis task (Vercoe, Gardner, 
& Scheirer, 1998). 

In this paper, we focus on a subproblem in music-ir, where we assume that exact timing 
information of notes is available, for example as a stream of MIDI 1 events from a digital 
keyboard. 

A model for tempo tracking and transcription from a MIDI-like music representation 
is useful in a broad spectrum of applications. One example is automatic score typesetting, 

1. Musical Instruments Digital Interface. A standard communication protocol especially designed for digital 
instruments such as keyboards. Each time a key is pressed, a MIDI keyboard generates a short message 
containing pitch and key velocity. A computer can tag each received message by a timestamp for real-time 
processing and/or recording into a file. 
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the musical analog of word processing. Almost all score typesetting applications provide a 
means of automatic generation of a conventional music notation from MIDI data. 

In conventional music notation, the onset time of each note is implicitly represented by 
the cumulative sum of durations of previous notes. Durations are encoded by simple rational 
numbers (e.g., quarter note, eighth note), consequently all events in music are placed on 
a discrete grid. So the basic task in MIDI transcription is to associate onset times with 
discrete grid locations, i.e., quantization. 

However, unless the music is performed with mechanical precision, identification of the 
correct association becomes difficult. This is due to the fact that musicians introduce 
intentional (and unintentional) deviations from a mechanical prescription. For example 
timing of events can be deliberately delayed or pushed. Moreover, the tempo can fluctuate 
by slowing down or accelerating. In fact, such deviations are natural aspects of expressive 
performance; in the absence of these, music tends to sound rather dull and mechanical. 
On the other hand, if these deviations are not accounted for during transcription, resulting 
scores have often very poor quality. 

Robust and fast quantization and tempo tracking is also an important requirement for 
interactive performance systems; applications that "listen" to a performer for generating an 
accompaniment or improvisation in real time (Raphael, 2001b; Thorn, 2000). At last, such 
models are also useful in musicology for systematic study and characterization of expressive 
timing by principled analysis of existing performance data. 

From a theoretical perspective, simultaneous quantization and tempo tracking is a 
"chicken-and-egg" problem: the quantization depends upon the intended tempo interpre- 
tation and the tempo interpretation depends upon the quantization. Apparently, human 
listeners can resolve this ambiguity (in most cases) without any effort. Even persons without 
any musical training are able to determine the beat and the tempo very rapidly. However, 
it is still unclear what precisely constitutes tempo and how it relates to the perception of 
the beat, rhythmical structure, pitch, style of music etc. Tempo is a perceptual construct 
and cannot directly be measured in a performance. 

The goal of understanding tempo perception has stimulated a significant body of re- 
search on the psychological and computational modeling aspects of tempo tracking and 
beat induction, e.g., see (Desain & Honing, 1994; Large Sz Jones, 1999; Toiviainen, 1999). 
These papers assume that events are presented as an onset list. Attempts are also made 
to deal directly with the audio signal (Goto Sz Muraoka, 1998; Scheirer, 1998; Dixon & 
Cambouropoulos, 2000). 

Another class of tempo tracking models are developed in the context of interactive 
performance systems and score following. These models make use of prior knowledge in the 
form of an annotated score (Dannenberg, 1984; Vercoe & Puckette, 1985). More recently, 
Raphael (2001b) has demonstrated an interactive real-time system that follows a solo player 
and schedules accompaniment events according to the player's tempo interpretation. 

Tempo tracking is crucial for quantization, since one can not uniquely quantize onsets 
without having an estimate of tempo and the beat. The converse, that quantization can 
help in identification of the correct tempo interpretation has already been noted by Desain 
and Honing (1991). Here, one defines correct tempo as the one that results in a simpler 
quantization. However, such a schema has never been fully implemented in practice due 
to computational complexity of obtaining a perceptually plausible quantization. Hence 
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quantization methods proposed in the literature either estimate the tempo using simple 
heuristics (Longuet-Higgins, 1987; Pressing Sz Lawrence, 1993; Agon, Assayag, Fineberg, 
<fc Rueda, 1994) or assume that the tempo is known or constant (Desain & Honing, 1991; 
Cambouropoulos, 2000; Hamanaka, Goto, Asoh, &: Otsu, 2001). 

Our approach to transcription and tempo tracking is from a probabilistic, i.e., Bayesian 
modeling perspective. In Cemgil et al. (2000), we introduced a probabilistic approach to 
perceptually realistic quantization. This work also assumed that the tempo was known or 
was estimated by an external procedure. For tempo tracking, we introduced a Kalman filter 
model (Cemgil, Kappen, Desain, & Honing, 2001). In this approach, we modeled the tempo 
as a smoothly varying hidden state variable of a stochastic dynamical system. 

In the current paper, we integrate quantization and tempo tracking. Basically, our 
model balances score complexity versus smoothness in tempo deviations. The correct tempo 
interpretation results in a simple quantization and the correct quantization results in a 
smooth tempo fluctuation. An essentially similar model is proposed recently also by Raphael 
(2001a). However, Raphael uses an inference technique that only applies for small models; 
namely when the continuous hidden state is one dimensional. This severely restricts the 
models one can consider. In the current paper, we survey general and widely used state-of- 
the-art techniques for inference. 

The outline of the paper is as follows: In Section 2, we propose a probabilistic model for 
timing deviations in expressive music performance. Given the model, we will define tempo 
tracking and quantization as inference of posterior quantities. It will turn out that our model 
is a switching state space model in which computation of exact probabilities becomes in- 
tractable. In Section 3, we will introduce approximation techniques based on simulation, 
namely Markov Chain Monte Carlo (MCMC) and sequential Monte Carlo (SMC) (Doucet, 
de Freitas, & Gordon, 2001; Andrieu, de Freitas, Doucet, & Jordan, 2002). Both approaches 
provide flexible and powerful inference methods that have been successfully applied in di- 
verse fields of applied sciences such as robotics (Fox, Burgard, & Thrun, 1999), aircraft 
tracking (Gordon, Salmond, &: Smith, 1993), computer vision (Isard &: Blake, 1996), econo- 
metrics (Tanizaki, 2001). Finally we will present simulation results and conclusions. 

2. Model 

Assume that a pianist is improvising and we are recording the exact onset times of each key 
she presses during the performance. We denote these observed onset times by yo,yi,y2 • • • 
Vk ■ ■ - Vk or more compactly by yo-.K- We neither have access to a musical notation of the 
piece nor know the initial tempo she has started her performance with. Moreover, the 
pianist is allowed to freely change the tempo or introduce expression. Given only onset 
time information yo-.K, we wish to find a score ji-.k and track her tempo fluctuations z^-.k- 
We will refine the meaning of 7 and z later. 

This problem is apparently ill-posed. If the pianist is allowed to change the tempo 
arbitrarily it is not possible to assign a "correct" score to a given performance. In other 
words any performance yo-.x can be represented by using a suitable combination of an 
arbitrary score with an arbitrary tempo trajectory. Fortunately, the Bayes theorem provides 
an elegant and principled guideline to formulate the problem. Given the onsets yo-.K, the 
best score ji-.k an d tempo trajectory z^-.k can be derived from the 'posterior distribution 
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that is given by 



P(ll:K,Z 0:K \yo:K) 



P(VO:k) 



P{yO:K\ll:K , Z0-.k)p(11:K , Z 0:K ) 



a quantity, that is proportional to the product of the likelihood term p(yo:K\7i:Ki zq-.k) and 
the prior term p(7i : K ? -^O:^)- 

In rhythm transcription and tempo tracking, the prior encodes our background knowl- 
edge about the nature of musical scores and tempo deviations. For example, we can con- 
struct a prior that prefers "simple" scores and smooth tempo variations. 

The likelihood term relates the tempo and the score to actual observed onset times. In 
this respect, the likelihood is a model for short time expressive timing deviations and motor 
errors that are introduced by the performer. 



Quantization Locations 




(m) (m) (m) 



Score 



•Tempo Trajectory 



•Noiseless onsets 



Observed Onsets 



Figure 1: Graphical Model. Square and oval nodes correspond to discrete and continuous 
variables respectively. In the text, we sometimes refer to the continuous hidden 
variables (t&, Ajt) by z^. The dependence between 7 and c is deterministic. All 
c, 7 , r and A are hidden; only onsets y are observed. 



2.1 Score prior 

To define a score ji-.k, we first introduce a sequence of quantization locations cq : k- A 
quantization location q. specifies the score time of the A;'th onset. We let 7^ denote the 
interval between quantization locations of two consecutive onsets 

Ik = Ck~ Cjfc-l (1) 

For example consider the conventional music notation j j~j which encodes the score 7^3 = 
[1 0.5 0.5]. Corresponding quantization locations are co:3 = [0 1 1.5 2]. 

One simple way of defining a prior distribution on quantization locations p(ck) is speci- 
fying a table of probabilities for c& mod 1 (the fraction of c&). For example if we wish to 
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allow for scores that have sixteenth notes and triplets, we define a table of probabilities for 
the states c mod 1 = {0, 0.25, 0.5, 0.75} U{0, 0.33, 0.67}. Technically, the resulting prior 
p(ck) is periodic and improper (since c\. are in principle unbounded so we can not normalize 
the distribution). 

However, if the number of states of c k mod 1 is large, it may be difficult to estimate the 
parameters of the prior reliably. For such situations we propose a "generic" prior as follows: 
We define the probability, that the A;'th onset gets quantized at location q., by p(c&) oc 
exp(— Ad(cfc)) where d{c k ) is the number of significant digits in the binary expansion of c& 
mod 1. For example d(l) = 0, c?(1.5) = 1, d(7 + 9/32) = 5 etc. The positive parameter A is 
used to penalize quantization locations that require more bits to be represented. Assuming 
that quantization locations of onsets are independent a-priori, (besides being increasing in 
k, i.e., c& > c&_i), the prior probability of a sequence of quantization locations is given by 
p{cq-.k) oc exp(— A Ylk=o ^( c k))- We further assume that cq 6 [0,1). One can check that 
such a prior prefers simpler notations, e.g., p{ Jlj=o ) < p{ J J~J )• We can generalize this 
prior to other subdivisions such triplets and quintiplets in Appendix A. 

Formally, given a distribution on cq-.k-, the prior of a score 71 is given by 

P{ll:K) = ^2p(11:k\c0:k)p(c0:k) (2) 

C0:K 

Since the relationship between cq : k and 71 : # is deterministic, p{^i:k\cq : k) is degenerate for 
any given cq : k, so we have 

/ K k \ 

p{1\-.k) oc exp I -A^c?(J^ 7A/ ) (3) 
V k=l k' = l J 

One might be tempted to specify a prior directly on 71. and get rid of cq : k entirely. 
However, with this simpler approach it is not easy to devise realistic priors. For example, 
consider a sequence of note durations [1 1/16 1 1 1 . . . ]. Assuming a factorized prior on 
7 that penalizes short note durations, this rhythm would have relatively high probability 
whereas it is quite uncommon in conventional music. 



2.2 Tempo prior 

We represent the tempo in terms of its inverse, i.e., the period, and denote it with A. For 
example a tempo of 120 beats per minute (bpm) corresponds to A = 60/120 = 0.5 seconds. 
At each onset the tempo changes by an unknown amount CA k - We assume the change ^ k 
is iid with A/"(0, Qa)- 2 We assume a first order Gauss-Markov process for the tempo 

A k = Afc.-i + CA, (4) 

Eq. 4 defines a distribution over tempo sequences Aq : k- Given a tempo sequence, the 
"ideal" or "intended" time of the next onset is given by 

Tk = T k -1 + lk^k-1 + Cr k (5) 

2. We denote a (scalar or multivariate) Gaussian distribution p(x) with mean vector [i and covariance 
matrix P by Af(p, P) = \2nP\~i exp(-±(x - p) T P _1 (x - fj,)). 
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The noise term Q Tk denotes the amount of accentuation (that is deliberately playing a 
note ahead or back in time) without causing the tempo to be changed. We assume ( Tk ~ 
Af(0, Q T ). Ideal onsets and actually observed "noisy" onsets are related by 

Vk = T k + e k (6) 

The noise term e& models small scale expressive deviations or motor errors in timing of indi- 
vidual notes. In this paper we will assume that has a Gaussian distribution parameterized 
by Af{0,R). 

The initial tempo distribution p(Aq) specifies a range of reasonable tempi and is given 
by a Gaussian with a broad variance. We assume an uninformative (flat) prior on to- The 
conditional independence structure is given by the graphical model in Figure 1. Table 1 
shows a possible realization from the model. 

We note that our model is a particular instance of the well known switching state space 
model (also known as conditionally linear dynamical system, jump Markov linear system, 
switching Kalman filter) (See, e.g., Bar-Shalom &: Li, 1993; Doucet &: Andrieu, 2001; 
Murphy, 2002). 



k 





1 


2 


3 


Ik 




} 


J 


J> ... 


Ck 





1/2 


3/2 


2 


A* 


0.5 


0.6 


0.7 




Tk 





0.25 


0.85 


1.20 ... 


Vk 





0.23 


0.88 


1.24 ... 



Table 1: A possible realization from the model: a ritardando. For clarity we assume £ T = 0. 

In the following sections, we will sometimes refer use Zk = (t&, A&) t and refer to zq : k 
as a tempo trajectory. Given this definition, we can compactly represent Eq. 4 and Eq. 5 by 

z k = ( I ™ + a (7) 

where Ck = (Ct^CaJ- 
2.3 Extensions 

There are several possible extensions to this basic parameterization. For example, one could 
represent the period A in the logarithmic scale. This warping ensures positivity and seems 
to be perceptually more plausible since it promotes equal relative changes in tempo rather 
than on an absolute scale (Grubb, 1998; Cemgil et al., 2001). Although the resulting model 
becomes nonlinear, it can be approximated fairly well by an extended Kalman filter (Bar- 
Shalom & Li, 1993). 

A simple random walk model for tempo fluctuations such as in Eq. 7 seems not to be 
very realistic. We would expect the tempo deviations to be more structured and smoother. 
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In our dynamical system framework such smooth deviations can be modeled by increasing 
the dimensionality of z to include higher order "inertia" variables (Cemgil et al., 2001). For 
example consider the following model, 



ft Ik Ik o 
10 




\ 



A 



Tk-1 

Ai,fc-i 

A2,fc-1 



(8) 



We choose this particular parameterization because we wish to interpret Ai as the slowly 
varying "average" tempo and A 2 as a temporary change in the tempo. Such a model is useful 
for situations where the performer fluctuates around an almost constant tempo; a random 
walk model is not sufficient in this case because it forgets the initial values. Additional state 
variables A3, . . . , Ab_i act like additional "memory" elements. By choosing the parameter 
matrix A and noise covariance matrix Q, one can model a rich range of temporal structures 
in expressive timing deviations. 

The score prior can be improved by using a richer model. For example to allow for 
different time signatures and alternative rhythmic subdivisions, one can introduce additional 
hidden variables (See Cemgil et al. (2000) or Appendix A) or use a Markov chain (Raphael, 
2001a). Potentially, such extensions make it easier to capture additional structure in musical 
rhythm (such as "weak" positions are followed more likely by "strong" positions). On the 
other hand, the number of model parameters rapidly increases and one has to be more 
cautious in order to avoid overfitting. 

For score typesetting, we need to quantize note durations as well, i.e., associate note 
offsets with quantization locations. A simple way of accomplishing this is to define an 
indicator sequence uq : k that identifies whether y^ is an onset (v,k = 1) or an offset (v,k = 
0). Given u^, we can redefine the observation model as p{yk\T~k, Uk) = u kN(0, R) + (1 — 
Uk)Af(0,R o g) where R g is the observation noise associated with offsets. A typical model 
would have i? g 3> R. For i? g — > 00, the offsets would have no effect on the tempo process. 
Moreover, since Uk are always observed, this extension requires just a simple lookup. 

In principle, one must allow for arbitrary long intervals between onsets, hence 7^ are 
drawn from an infinite (but discrete) set. In our subsequent derivations, we assume that the 
number of possible intervals is fixed a-priori. Given an estimate of z^-i and observation y^, 
almost all of the virtually infinite number of choices for 7^ will have almost zero probability 
and it is easy to identify candidates that would have significant probability mass. 

Conceptually, all of the above listed extensions are easy to incorporate into the model 
and none of them introduces a fundamental computational difficulty to the basic problems 
of quantization and tempo tracking. 
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2.4 Problem Definition 

Given the model, we define rhythm transcription, i.e., quantization as a MAP state estima- 
tion problem 

li-.K = argmaxj;(7i :X |y :x) (9) 

71: if 

P(ll:K\yO:K) = J dz 0:K p(j 1:K , Z 0:K \y 0:K ) 

and tempo tracking as a filtering problem 

z k = ^g ma ^^P(li:k,z k \y 0:k ) (10) 

Zk 71:* 

The quantization problem is a smoothing problem: we wish to find the most likely score 
1\-K gi yen an the onsets in the performance. This is useful in "offline" applications such as 
score typesetting. 

For real-time interaction, we need to have an online estimate of the tempo/beat z k . 
This information is carried forth by the filtering density p(7i : fc, Zk\yo-.k) i n Eq.10. Our 
definition of the best tempo z^ as the maximum is somewhat arbitrary. Depending upon 
the requirements of an application, one can make use of other features of the filtering 
density. For example, the variance of ^ p(7i:Jfe 5 Zk\yo :k ) can be used to estimate "amount 
of confidence" in tempo interpretation or arg max 2jfej7lijfe p^i-.k, Zk\yo-.k) to estimate most 
likely score-tempo pair so far. 

Unfortunately, the quantities in Eq. 9 and Eq. 10 are intractable due to the explosion in 
the number of mixture components required to represent the exact posterior at each step k 
(See Figure 2). For example, to calculate the exact posterior in Eq. 9 we need to evaluate 
the following expression: 

P(ll:K\yO:K) = ^ J dz 0:K p(y 0:K \z 0: K , H:k)p(zO:K \71:k)p{H:K ) (H) 
= TfP(yO:K\ri:K)p(ll:K) (12) 

where the normalization constant is given by Z = p(yo-.K) = S 7l K p{yo:K\'Yi:K)p{'Yi:K)- For 
each trajectory ji-.k, the integral over zq : k can be computed stepwise in k by the Kalman 
filter (See appendix B.l). However, to find the MAP state of Eq. 11, we need to evaluate 
p(yo-.K\ji:K) independently for each of the exponentially many trajectories. Consequently, 
the quantization problem in Eq. 9 can only be solved approximately. 

For accurate approximation, we wish to exploit any inherent independence structure of 
the exact posterior. Unfortunately, since z and c are integrated over, all j k become coupled 
and in general p(7i : k|?/0:k) does not possess any conditional independence structure (e.g., 
a Markov chain) that would facilitate efficient calculation. Consequently, we will resort to 
numerical approximation techniques. 

3. Monte Carlo Simulation 

Consider a high dimensional probability distribution 

p(x) = ^j;*(x) (13) 
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Figure 2: Example demonstrating the explosion of the number of components to represent the 
exact posterior. Ellipses denote the conditional marginals p(Tk,uJk\co:k,yo:k)- (We show 
the period in logarithmic scale where w& = log 2 A^). In this toy example, we assume 
that a score consists only of notes of length i and J, i.e., 7^ can be either 1/2 or 1. 
(a) We start with a unimodal posterior p(tq, ojo\co, Vo), e.g., a Gaussian centered at 
(t, ui) = (0,0). Since we assume that a score can only consist of eight- and quarter 
notes, i.e., 7^ € {1/2, 1}. the predictive distribution p(ri,aJi|co : i,i/o) is bimodal where 
the modes are centered at (0.5,0) and (1,0) respectively (shown with a dashed contour 
line) . Once the next observation y\ is observed (shown with a dashed vertical line around 
r = 0.5), the predictive distribution is updated to yield p(ri , ui\co-i , yo-.i) ■ The numbers 
denote the respective log-posterior weight of each mixture component, (b) The predictive 
distribution pfo, uJ2\co-.i, Uo-.i) at step k = 2 has now 4 modes, two for each component 
of p(ti,cji|co:i,2/o:i). (c) The number of components grows exponentially with k. 



53 



Cemgil & Kappen 



where the normalization constant Z = J dxp*(x) is not known but p*(x) can be evaluated 
at any particular x. Suppose we want to estimate the expectation of a function /(x) under 
the distribution p(x) denoted as 

(/(x)) p (x) = j dxf{x)p(x) 

e.g., the mean of x under p(x) is given by (x). The intractable integration can be approxi- 
mated by an average if we can find N points xW , i = 1 . . . N from p(x) 

(/w)p(x)«^E/( x(1) ) ( 14 ) 

i=l 

When xW are generated by independently sampling from p(x), it can be shown that as N 
approaches infinity, the approximation becomes exact. 

However, generating independent samples from p(x) is a difficult task in high dimen- 
sions but it is usually easier to generate dependent samples, that is we generate x( l+1 ) by 
making use of x^. It is somewhat surprising, that even if x^ and x( i+1 ) are correlated 
(and provided ergodicity conditions are satisfied), Eq. 14 remains still valid and estimated 
quantities converge to their true values when number of samples N goes to infinity. 

A sequence of dependent samples xW is generated by using a Markov chain that has 
the stationary distribution p(x). The chain is defined by a collection of transition proba- 
bilities, i.e., a transition kernel T(x( i+1 ) |x^). The definition of the kernel is implicit, in 
the sense that one defines a procedure to generate the x( l+1 ) given x^. The Metropolis 
algorithm (Metropolis & Ulam, 1949; Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 
1953) provides a simple way of defining an ergodic kernel that has the desired stationary 
distribution p(x). Suppose we have a sample xW . A candidate x' is generated by sam- 
pling from a symmetric proposal distribution </(x'|xW) (for example a Gaussian centered 
at xW). The candidate x' is accepted as the next sample x( l+1 ) if p(x') > p(xW). If x' 
has a lower probability, it can be still accepted, but only with probability p(x')/p(xW). 
The algorithm is initialized by generating the first sample x^ according to an (arbitrary) 
proposal distribution. 

However for a given transition kernel T, it is hard to assess the time required to converge 
to the stationary distribution so in practice one has to run the simulation until a very large 
number of samples have been obtained, (see e.g., Roberts & Rosenthal, 1998). The choice 
of the proposal distribution q is also very critical. A poor choice may lead to the rejection of 
many candidates x' hence resulting in a very slow convergence to the stationary distribution. 

For a large class of probability models, where the full posterior p(x) is intractable, one 
can still efficiently compute marginals of form p(:£fc|x_&), x_& = x\ . . . a^-i, 2^+1, . . . xk 
exactly. In this case one can apply a more specialized Markov chain Monte Carlo (MCMC) 
algorithm, the Gibbs sampler given below. 

1. Initialize x^) K by sampling from a proposal q{x\ : K) 

2. For i = ... N - 1 
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• For k = 1, . . . , K , Sample 

~ P^fcri'it-H X k + 1:K) ( 15 ) 

In contrast to the Metropolis algorithm, where the new candidate is a vector x', the 
Gibbs sampler uses the exact marginal p(xk\*—k) as the proposal distribution. At each 
step, the sampler updates only one coordinate of the current state x, namely x^, and the 
new candidate is guaranteed to be accepted. 

Note that, in principle we don't need to sample x^ sequentially, i.e., we can choose k 
randomly provided that each slice is visited equally often in the limit. However, a deter- 
ministic scan algorithm where k = 1, . . . K, provides important time savings in the type of 
models that we consider here. 

3.1 Simulated Annealing and Iterative Improvement 

Now we shift our focus from sampling to MAP state estimation. In principle, one can use the 
samples generated by any sampling algorithm (Metropolis-Hastings or Gibbs) to estimate 
the MAP state x* of p(x) by argmaxp(xW). However, unless the posterior is very much 

i=l:N 

concentrated around the MAP state, the sampler may not visit x* even though the samples 
x^ are obtained from the stationary distribution. In this case, the problem can be simply 
reformulated to sample not from p(x) but from a distribution that is concentrated at local 
maxima of p(x). One such class of distributions are given by p p . (x) oc p(x) p J . A sequence of 
exponents p\ < p2 < ■ ■ ■ < Pj < ■ ■ ■ is called to be a cooling schedule or annealing schedule 
owing to the inverse temperature interpretation of pj in statistical mechanics, hence the 
name Simulated Annealing (SA) (Aarts & van Laarhoven, 1985). When pj — > oo sufficiently 
slowly in j, the cascade of MCMC samplers each with the stationary distribution p Pj (x) is 
guaranteed (in the limit) to converge to the global maximum of p(x). Unfortunately, for this 
convergence result to hold, the cooling schedule must go very slowly (in fact, logarithmically) 
to infinity. In practice, faster cooling schedules must be employed. 

Iterative improvement (II) (Aarts & van Laarhoven, 1985) is a heuristic simulated an- 
nealing algorithm with a very fast cooling schedule. In fact, pj = oo for all j. The eventual 
advantage of this greedy algorithm is that it converges in a few iterations to a local max- 
imum. By restarting many times from different initial configurations x, one hopes to find 
different local maxima of p(x) and eventually visit the MAP state x*. In practice, by using 
the II heuristic one may find better solutions than SA for a limited computation time. 

From an implementation point of view, it is trivial to convert MCMC code to SA (or II) 
code. For example, consider the Gibbs sampler. To implement SA, we need to construct 
a cascade of Gibbs samplers, each with stationary distribution p(x)^ . The exact one time 
slice marginal of this distribution is p{xk\x.-k) Pj ■ So, SA just samples from the actual 
(temperature=l) marginal p{xk\x—k) raised to a power pj. 

3.2 The Switching State Space Model and MAP Estimation 

To solve the rhythm quantization problem, we need to calculate the MAP state of the 
posterior in Eq. 11 

P(ll:K\yO:K) OC p(jl:K) / dz 0: KP{yO:K \zq-.K, 7l:K)p(zO:K \ll:K) (16) 
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This is a combinatorial optimization problem: we seek the maximum of a functionp(7i : ^|yo:K) 
that associates a number with each of the discrete configurations 'Ji-.k- Since it is not feasible 
to visit all of the exponentially many configurations to find the maximizing configuration 
7*-io we wn l resor t t° stochastic search algorithms such as simulated annealing (SA) and 
iterative improvement (II). Due to the strong relationship between the Gibbs sampler and 
SA (or II), we will first review the Gibbs sampler for the switching state space model. 

The first important observation is that, conditioned on 71^, the model becomes a linear 
state space model and the integration on zq : k can be computed analytically using Kalman 
filtering equations. Consequently, one can sample only 71^ and integrate out z. The 
analytical marginalization, called Rao-Blackwellization (Casella &: Robert, 1996), improves 
the efficiency of the sampler (e.g., see Doucet, de Preitas, Murphy, & Russell, 2000a). 

Suppose now that each switch variable 7^ can have S distinct states and we wish to 
generate N samples (i.e trajectories) {'Ji) K ,i = 1...N}. A naive implementation of the 
Gibbs sampler requires that at each step k we run the Kalman filter S times on the whole 

observation sequence uq-.k to compute the proposal p{ r ik\l\\-i-,l^k+ vk-> Vo-.k)- This would 
result in an algorithm of time complexity 0(NK 2 S) that is prohibitively slow when K is 
large. Carter and Kohn (1996) have proposed a much more time efficient deterministic scan 
Gibbs sampler that circumvents the need to run the Kalman filtering equations at each 
step k on the whole observation sequence yo-.K- See also (Doucet & Andrieu, 2001; Murphy, 
2002). 

The method is based on the observation that the proposal distribution p(^k\ ) can 
be factorized as a product of terms that either depend on past observations yo:/fc or t ne 
future observations yu+i-.K- So the contribution of the future can be computed a-priori by 
a backward filtering pass. Subsequently, the proposal is computed and samples 7^ are 
generated during the forward pass. The sampling distribution is given by 

P{lk\l-k,y0:K) OCp(j k \l-k)p(y0:K\ll:K) (17) 

where the first term is proportional to the joint prior p^kll-k) °^ Pilk-,1 -k)- The second 
term can be decomposed as 

P{y0:K\ll:K) = J dz k p(y k+ l:K\yO:k, *k, ll:K)p{y0:k, %k \11-.k) (18) 
= I dz k p(y k+ l:K\Zk,Jk+l:K)p(yO:k,Zk\jl:k) (19) 

Both terms are (unnormalized) Gaussian potentials hence the integral can be evaluated 
analytically. The term p(yk+i:K\ z ki7k+i:K) is an unnormalized Gaussian potential in z k and 
can be computed by backwards filtering. The second term is just the filtering distribution 
p( z k\yo:ki 7i:jfc) scaled by the likelihood p{yo:k\li:k) an d can be computed during forward 
filtering. The outline of the algorithm is given below, see the appendix B.l for details. 

1. Initialize 7^ by sampling from a proposal qi^i-.x) 

2. For i = 1 ... N 

• For k = K -1,...,0, 
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- Compute p{yk+i:K\zk,T k+ i :K ) 
• For k = 1,..., K, 

- For a = 1 ... S 

* Compute the proposal 

P(lk = 4) ^Phk = s,7-fc) y ^^(yoiJfc^JfclTil-uTA; = s^^+iiKkfe^iVi 

- Sample 7^ from p{^k\ m ) 

The resulting algorithm has a time complexity of O(NKS), an important saving in terms 
of time. However, the space complexity increases from 0(1) to 0(K) since expectations 
computed during the backward pass need to be stored. 

At each step, the Gibbs sampler generates a sample from a single time slice k. In 
certain types of "sticky" models, such as when the dependence between 7^ and 7^+1 is 
strong, the sampler may get stuck in one configuration, moving very rarely. This is due to 
the fact that most singleton flips end up in low probability configurations due to the strong 
dependence between adjacent time slices. As an example, consider the quantization model 
and two configurations [. . . 7^, 7^+1 . . . ] = [...1,1...] and [. . . 3/2, 1/2 ... ]. By updating 
only a single slice, it may be difficult to move between these two configurations. Consider 
an intermediate configuration [. . . 3/2, 1 . . . ]. Since the duration (7^ + 7^+1) increases, all 
future quantization locations c^-.k are shifted by 1/2. That may correspond to a score that 
is heavily penalized by the prior, thus "blocking" the path. 

To allow the sampler move more freely, i.e., to allow for more global jumps, one can 
sample from L slices jointly. In this case the proposal distribution takes the form 

p(jk:k+L-l\-) OC p(j k :k+L-l,J-(k:k+L-l)) X 

J d2jfe + L-lp(yO:fc+L-l,2jfe+^ 

Similar to the one slice case, terms under the integral are unnormalized Gaussian potentials 
(on Zk +l-i) representing the contribution of past and future observations. Since 7fc:A;+L-i 
has S L states, the resulting time complexity for generating N samples is 0(NKS L ), thus in 
practice L must be kept rather small. One remedy would be to use a Metropolis-Hastings 
algorithm with a heuristic proposal distribution q{"1k:k+L-i\yo:K) to circumvent exact cal- 
culation, but it is not obvious how to construct such a q. 

One other shortcoming of the Gibbs sampler (and related MCMC methods) is that the 
algorithm in its standard form is inherently offline; we need to have access to all of the 
observations y$ : K to start the simulation. For certain applications, e.g., automatic score 
typesetting, a batch algorithm might be still feasible. However in scenarios that require 
real-time interaction, such as in interactive music performance or tempo tracking, online 
methods must be used. 

3.3 Sequential Monte Carlo 

Sequential Monte Carlo, a.k.a. particle filtering, is a powerful alternative to MCMC for 
generating samples from a target posterior distribution. SMC is especially suitable for 
application in dynamical systems, where observations arrive sequentially. 
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The basic idea in SMC is to represent the posterior p{xo-.k-i\yo:k-i) a t time k — 1 by 
a (possibly weighted) set of samples = 1 . . . N} and extend this representation 

to {{xq) i ,_ 1 , x^), i = 1 . . . N} when the observation y k becomes available at time k. The 
common practice is to use importance sampling. 

3.3.1 Importance Sampling 

Consider again a high dimensional probability distributionp(x) = p*(x)/Z with an unknown 
normalization constant. Suppose we are given a proposal distribution q(x) that is close to 
p(x) such that high probability regions of both distributions fairly overlap. We generate 
independent samples, i.e., particles, from the proposal such that q(x) ~ YliLi <^( x ~~ 
xW)/JV. Then we can approximate 

1 p*(x) 

J;(x) = -zW) q{) 

Zg(x)iV^' (X ) 

w (i) 

* £-#V x - x(<,) 

i= i E i= i ™ w 

where w;W = p*(xW)/g(xW) are the importance weights. One can interpret as correc- 
tion factors to compensate for the fact that we have sampled from the "incorrect" distri- 
bution q{x). Given the approximation in Eq.22 we can estimate expectations by weighted 
averages 

N 

</(*)>„(x) « $>«/(x«) (23) 

where = w^/Y^f = \ w ^ are the normalized importance weights. 

3.3.2 Sequential Importance Sampling 

Now we wish to apply importance sampling to the dynamical model 

K 

p{xq:k\vq:k) oc JJ p(y k \x k )p{x k |a;o:fc-i) (24) 

where a; = {z, 7}. In principle one can naively apply standard importance sampling by using 
an arbitrary proposal distribution q{xQ : x)- However finding a good proposal distribution 
can be hard if K 3> 1. The key idea in sequential importance sampling is the sequential 
construction of the proposal distribution, possibly using the available observations yo :k , i.e., 

K 

q(X():K\yO:K) = q{x k \xo :k — i,yo :k ) 

k=0 
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Given a sequentially constructed proposal distribution, one can compute the importance 
weight recursively as 

q{Xo; k \yO:k) 9(4 |Z0:fc-lM):/fe) ^K/fe-l IS/0:*-l) 

= — ; — k - 1 (26) 

The sequential update schema is potentially more accurate than naive importance sam- 
pling since at each step k, one can generate a particle from a fairly accurate proposal 
distribution that takes the current observation y k into account. A natural choice for the 
proposal distribution is the filtering distribution given as 

q( x k\xo±-iyo--k) =p( x k\xo) k _ v yo:k) (27) 
In this case the weight update rule in Eq. 26 simplifies to 



In fact, provided that the proposal distribution q is constructed sequentially and past sam- 
pled trajectories are not updated, the filtering distribution is the optimal choice in the sense 
of minimizing the variance of importance weig hts ujW (Doucet, Godsill, & Andrieu, 2000b). 
Note that Eq. 27 is identical to the proposal distribution used in Gibbs sampling at k = K 
(Eq 15). At k < K, the SMC proposal does not take future observations into account; so 
we introduce discount factors w k to compensate for sampling from the wrong distribution. 

3.3.3 Selection 

Unfortunately, the sequential importance sampling may be degenerate, in fact, it can be 
shown that the variance of w£ increases with k. In practice, after a few iterations of 
the algorithm, only one particle has almost all of the probability mass and most of the 
computation time is wasted for updating particles with negligible probability. 

To avoid the undesired degeneracy problem, several heuristic approaches are proposed 
in the literature. The basic idea is to duplicate or discard particles according to their 
normalized importance weights. The selection procedure can be deterministic or stochas- 
tic. Deterministic selection is usually greedy; one chooses N particles with the highest 
importance weights. In the stochastic case, called resampling, particles are drawn with a 

(i) 

probability proportional to their importance weight w k . Recall that normalized weights 
{w^\i = 1 . . . N} can be interpreted as a discrete distribution on particle labels (i). 

3.4 SMC for the Switching State Space Model 

The SIS algorithm can be directly applied to the switching state space model by sampling 
directly from x k = (^,7^). However, the particulate approximation can be quite poor if z 
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Figure 3: Outline of the algorithm. The ellipses correspond to the conditionals 
p{ z k\lk\ Ua-.k)- Vertical dotted lines denote the observations y^. At each step 
k, particles with low likelihood are discarded. Surviving particles are linked to 
their parents. 



is high dimensional. Hence, too many particles may be needed to accurately represent the 
posterior. 

Similar to the MCMC methods introduced in the previous section, efficiency can be 
improved by analytically integrating out Zo : k and only sampling from 7 1: ^. In fact, this 
form of Rao-Blackwellization is reported to give superior results when compared to standard 
particle filtering where both 7 and z are sampled jointly (Chen & Liu, 2000; Doucet et al., 
2000b). The improvement is perhaps not surprising, since importance sampling performs 
best when the sampled space is low dimensional. 

The algorithm has an intuitive interpretation in terms of a randomized breadth first tree 
search procedure: at each new step k, we expand N kernels to obtain S x N new kernels. 
Consequently, to avoid explosion in the number of branches, we select N out of S x N 
branches proportional to the likelihood, See Figure 3. The derivation and technical details 
of the algorithm are given in the Appendix C. 

The tree search interpretation immediately suggests a deterministic version of the al- 
gorithm where one selects (without replacement) the N branches with highest weight. We 
will refer to this method as a greedy filter (GF). The method is also known as split-track 
filter (Chen & Liu, 2000) and is closely related to Multiple Hypothesis Tracking (MHT) 
(Bar-Shalom Sz Fortmann, 1988). One problem with the greedy selection schema of GF is 
the loss of particle diversity. Even if the particles are initialized to different locations in zo, 
(e.g., to different initial tempi), mainly due to the discrete nature of the state space of 7^, 
most of the particles become identical after a few steps k. Consequently, results can not 
be improved by increasing the number of particles N. Nevertheless, when only very few 
particles can be used, say e.g., in a real time application, GF may still be a viable choice. 
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Figure 4: A hypothetical situation where neither of the two particles 7^5 is optimal. We 
would obtain eventually a higher likelihood configuration by interchanging 73 
between particles. 



3.5 SMC and estimation of the MAP trajectory 

Like MCMC, SMC is a sampling method. Hence comments made in Section 3.1 about the 
eventual suboptimality of estimating the MAP trajectory from particles as arg maxp(7^ Iuo-.k) 
also apply here. An hypothetical situation is shown in figure 4. 

One obvious solution is to employ the SA "trick" and raise the proposal distribution to 
a power p(jk\'V ■ However, such a proposal will be peaked on a very few 7 at each time 
slice. Consequently, most of the particles will become identical in time and the algorithm 
eventually degenerates to greedy filtering. 

An algorithm for estimating the MAP trajectory from a set of SMC samples is recently 
proposed in the literature (Godsill, Doucet, & West, 2001). The algorithm relies on the 

(i) 

observation that once the particles x, are sampled during the forward pass, one is left with 
a discrete distribution defined on the (discrete) support X\-k = ®jfcLi ^-k- Here X k denotes 
is the support of the filtering distribution a time k and is the Cartesian product between 
sets. Formally, X k is the set of distinct samples at time k and is given by X k = \Ji{x^}. 

The distribution p{X\ : x \ Ui-.k) 3 is Markovian because the original state transition model 
is Markovian, i.e., the posterior can be represented exactly by 

K 

p{Xi,k\vv,k) oc \\p{yk\X k )p{X k \X k _i) 
k=\ 

Consequently, one can find the best MAP trajectory arg maxp(Xi : ^-) by using an algorithm 
that is analogous to the Viterbi algorithm for hidden Markov models (Rabiner, 1989). 

However, this idea does not carry directly to the case when one applies Rao-Black- 
wellization. In general, when a subset of the hidden variables is integrated out, all time 

slices of the posterior p(Tv.K\Vi:k) are coupled, where T\ : k = ® k =\ ^k an d T k = U»{7* }■ 
One can still employ a chain approximation and run Viterbi, (e.g., Cemgil & Kappen, 
2002), but this does not guarantee to find arg maxp(ri : ^-|yi :J t). 

(i) 

On the other hand, because 7^ are drawn from a discrete set, several particles become 
identical so T k has usually a small cardinality when compared to the number of particles 
N. Consequently, it becomes feasible to employ SA or II on the reduced state space Ti^; 
possibly using a proposal distribution that extends over several time slices L. 

3. By a slight abuse of notation we use the symbol Xk both as a set and as a general element when used 
in the argument of a density, p(yk\Xk) means p(yk\xk) s.t. a& € Xk 
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In practice, for finding the MAP solution from the particle set {7}.^,? = 1 . . . N}, we 
propose to find the best trajectory i* = argmaxjp(yo:i<"|7j!^)p(7j!^) and apply iterative 
improvement starting from the initial configuration 7^- 



4. Simulations 

We have compared the inference methods in terms of the quality of the solution and exe- 
cution time. The tests are carried out both on artificial and real data. 

Given the true notation 7^ e , we measure the quality of a solution in terms of the 
log-likelihood difference 

Ar = , P{yO:K\ll:K)p{ll:K) 

and in terms of edit distance 

K 

e( ll:K ) = J2(l-S(lk-ir e )) 
k=l 

The edit distance e(7i : x) gives simply the number of notes that are quantized wrongly. 



4.1 Artificial data: Clave pattern 

The synthetic example is a repeating "son-clave" pattern IM J J M 1 1 J : ll(c = [1, 2, 4, 5.5, 
7 . . . ]) with fluctuating tempo. We repeat the pattern 6 times and obtain a score ji-.k with 
K = 30. 

Such syncopated rhythms are usually hard to transcribe and make it difficult to track 
the tempo even for experienced human listeners. Moreover, since onsets are absent at 
prominent beat locations, standard beat tracking algorithms usually loose track. 

Given score ji-.k, we have generated 100 observation sequences yo-.K by sampling from 
the tempo model in Eq. 7. We have parameterized the observation noise variance 4 as 
Q = IhQa + Qb- in this formulation, the variance depends on the length of the interval 
between consecutive onsets; longer notes in the score allow for more tempo and timing 
fluctuation. For the tests on the clave example we have not used a prior model that reflects 
true source statistics, instead, we have used the generic prior model defined in Section 2.1 
with A = 1. 

All the example cases are sampled from the same score (clave pattern). However, due 
to the use of the generic prior (that does not capture the exact source statistics well) and a 
relatively broad noise model, the MAP trajectory 7*.^ given yQ : x is not always identical to 
the original clave pattern. For the «'th example, we have defined the "ground truth" 7 1 !^ ! ' 2 as 
the highest likelihood solution found using any sampling technique during any independent 
run. Although this definition of the ground truth introduces some bias, we have found 
this exercise more realistic as well as more discriminative among various methods when 
compared to, e.g.,, using a dataset with essentially shorter sequences where the exact MAP 

4. The noise covariance parameters were R = 0.02 2 , Q a = 0.06 2 / and Qb = 0.02 2 /. I is a 2 x 2 identity 
matrix. 
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trajectory can be computed by exhaustive enumeration. The wish to stress that the main 
aim of the simulations on synthetic dataset is to compare effectiveness of different inference 
techniques; we postpone the actual test whether the model is a good one to our simulations 
on real data. 

We have tested the MCMC methods, namely Gibbs sampling (Gibbs), simulated an- 
nealing (SA) and iterative improvement (II) with one and two time slice optimal proposal 
and for 10 and 50 sweeps. For each onset the optimal proposal pi'jkl') is computed 
always on a fixed set, T = {0, 1/4, 2/4 . . . 3}. Figure 6 shows a typical run of MCMC. 

Similarly, we have implemented the SMC for N = {1,5,10,50,100} particles. The 
selection schema was random drawing from the optimal proposal p(jk\') computed using 
one or two time slices. Only in the special case of greedy filtering (GF), i.e., when N = 1, we 
have selected the switch with maximum probability. An example run is shown in Figure 5. 

We observe that on average SMC results are superior to MCMC (Figure 7). We observe 
that, increasing the number of sweeps for MCMC does not improve the solution significantly. 
On the other hand, increasing the number of particles seems to improve the quality of the 
SMC solution monotonically. Moreover, the results suggest that sampling from two time 
slices jointly (with the exception of SA ) does not have a big effect. GF outperforms a 
particle filter with 5 particles that draws randomly from the proposal. That suggests that 
for PF with a small number of particles N, it may be desirable to use a hybrid selection 
schema that selects the particle with maximum weight automatically and randomly selects 
the remaining N — 1. 

We compare inference methods in terms of execution time and the quality of solutions (as 
measured by edit distance). As Figure 8 suggests, using a two slice proposal is not justified. 
Moreover it seems that for comparable computational effort, SMC tends to outperform all 
MCMC methods. 

4.2 Real Data: Beatles 

We evaluate the performance of the model on polyphonic piano performances. 12 pianists 
were invited to play two Beatles songs, Michelle and Yesterday. Both pieces have a relatively 
simple rhythmic structure with ample opportunity to add expressiveness by fluctuating the 
tempo. The original score is shown in Figure 9(a). The subjects had different musical edu- 
cation and background: four professional jazz players, four professional classical performers 
and four amateur classical pianists. Each arrangement had to be played in three tempo 
conditions, three repetitions per tempo condition. The tempo conditions were normal, slow 
and fast tempo, all in a musically realistic range and all according to the judgment of the 
performer. Further details are reported in (Cemgil et al., 2001). 

4.2.1 Preprocessing 

The original performances contained several errors, such as missing notes or additional notes 
that were not on the original score. Such errors are eliminated by using a matching tech- 
nique (Heijink, Desain, & Honing, 2000) based on dynamical programming. However, visual 
inspection of the resulting dataset suggested still several matching errors that we interpret 
as outliers. To remove these outliers, we have extended the quantization model with a two 
state switching observation model, i.e., the discrete space consists of (jk^k)- I n this simple 
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Figure 5: Particle filtering on clave example with 4 particles. Each circle denotes the mean 
(tju , cd^ ) where (Ji^ = log 2 A^. The diameter of each particle is proportional 
to the normalized importance weight at each generation. '*' denote the true 
(t,ui) pairs; here we have modulated the tempo deterministically according to 
ui% = 0.3sin(27rc/j/32), observation noise variance is R = 0.025 2 . 
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gure 6: Typical runs of Gibbs sampling, Simulated Annealing (SA) and Iterative Improvement 
(II) on clave example. All algorithms are initialized to the greedy filter solution. The 
annealing schedule for SA was linear from pi =0.1 to P33 = 10 and than proceeding de- 
terministically by ^34:50 = 00. When SA or II converge to a configuration, we reinitialize 
by a particle filter with one particle that draws randomly proportional to the optimal 
proposal. Sharp drops in the likelihood correspond to reinitializations. We see that, at 
the first sweep, the greedy filter solution can only be slightly improved by II. Conse- 
quently the sampler reinitializes. The likelihood of SA drops considerably, mainly due to 
the high temperature, and consequently stabilizes at a suboptimal solution. The Gibbs 
sampler seems to explore the support of the posterior but is no able to visit the MAP 
state in this run. 
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Figure 7: Comparison of inference methods on the clave data. The squares and ovals denote the 
median and the vertical bars correspond to the interval between %25 and %75 quantiles. 
We have tested the MCMC methods (Gibbs, SA and II) independently for 10 and 50 
(shown from left to right). The SMC methods are the greedy filter (GF) and particle filter 
(PF). We have tested filters with N = {5,10,50,100} particles independently (shown 
from left to right.). 
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Figure 8: Comparison of execution time in terms of floating point operations. For all methods, the 
first number (1 or 2) denotes the number slices used by the optimal proposal distribution. 
For the particle filter (PF), the second number denotes the number of particles. The 
dashed lines are merely used to connect related methods. 



outlier detection mechanism, each switch is a binary indicator variable specifying whether 
the onset y& is an outlier or not. We assume that all indicators are independent a-priori 
and have a uniform prior. The observation model is given by p(yk\iki T k) = N(0,Ri k ) 5 . 
Since the score ji-.k is known, the only unknown discrete quantities are the indicators io-.K- 
We have used greedy filtering followed by iterative improvement to find the MAP state 
of indicators Iq : k and eliminated outliers in our further studies. For many performances, 
there were around 2 — 4 outliers, less than 1% of all the notes. The resulting dataset can 
be downloaded from the url http://www.snn.kun.nl/~cemgil. 

4.2.2 Parameter Estimation 

We have trained tempo tracking models with different dimensionality D, where D denotes 
the dimension of the hidden variable z. In all of the models, we use a transition matrix that 
has the form in Eq. 8. 

Since the true score is known, i.e., the quantization location c& of each onset y& is given, 
we can clamp all the discrete variables in the model. Consequently, we can estimate the 
observation noise variance R, the transition noise variance Q and the transition matrix 
coefficients A from data. 

We have optimized the parameters by Expectation-Maximization (EM) for the linear 
dynamical systems (Shumway &: Stoffer, 1982; Ghahramani &: Hinton, 1996) using all perfor- 



5. We took R ik =o = 0.002 and Ri h=1 = 2. 
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mances of "Yesterday" as training data. Similarly, the score prior parameters are estimated 
by frequency counts from the score of "Yesterday" 6 . All tests are carried out on "Michelle" . 

4.2.3 Results 

In Figure 9 we show the result of typesetting a performance with and without tempo 
tracking. Due to fluctuations in tempo, the quality of the automatically generated score is 
very poor. The quality can be significantly improved by using our model. 

Figure 10 shows some tempo tracking examples on Michelle dataset for pianists from 
different background and training. We observe that in most cases the results are satisfactory. 

In Figure 11, we give a summary of test results on Michelle data in terms of the log- 
likelihood and edit distance as a function of model order and number of particles used for 
inference. Figure 11(a) shows that the median likelihood on test data is increasing with 
model order. This suggests that a higher order filter is able to capture structure in pi- 
anists' expressive timing. Moreover, as for the sythetic data, we see a somewhat monotonic 
increase in the likelihood of solutions found when using more particles. 

The edit distance between the original score and the estimates are given in Figure 11(b). 
Since both pieces are arranged for piano, due to polyphony, there are many onsets that are 
associated with the same quantization location. Consequently, many 7£ me in the original 
score are effectively zero. In such cases, typically, the corresponding inter onset interval 
Vk — Vk-i is also very small and the correct quantization (namely 7^ = 0) can be identified 
even if the tempo estimate is completely wrong. As a consequence, the edit distance remains 
small. To make the task slightly more challenging, we exclude the onsets with 7^ me = 
from edit distance calculation. 

We observe that the extra prediction ability obtained using a higher order model does 
not directly translate to a better transcription. The errors are around 5% for all models. 
On the other hand, the variance of edit distance for higher order models is smaller. This 
suggests that higher order models tend to be more robust against divergence from the 
original tempo track. 

5. Discussion 

We have presented a switching state space model for joint rhythm quantization and tempo 
tracking. The model describes the rhythmic structure of musical pieces by a prior distri- 
bution over quantization locations. In this representation, it is easy to construct a generic 
prior that prefers simpler notations and to learn parameters from a data set. The prior on 
quantization locations cq : k translates to a non-Markovian distribution over a score ^\ : k- 

Timing deviations introduced by performers (tempo fluctuation, accentuations and mo- 
tor errors) are modeled as independent Gaussian noise sources. Performer specific timing 
preferences are captured by the parameters of these distributions. 

Given the model, we have formulated rhythm quantization as a MAP state estimation 
problem and tempo tracking as a filtering problem. We have introduced Markov chain 

6. The maximum likelihood parameters for a model of dimension D = 3 are found to be: a = —0.072, R = 
0.013 2 and q T = 0.008 2 , q Al = 0.007 2 and qa 2 = 0.050 2 . The prior p(c) is p(0) = 0.80, p(l/3) = 0.0082, 
p(l/2) = 0.15 p(5/6) = 0.0418. Remaining p(c) are set to 10~ 6 . 
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Michelle 




(b) Typesetting without 
processing by the model. 
Due to fluctuations in 
tempo, the quality of the 
score is poor. 



(c) Typesetting after tempo 
tracking and quantization 
with a particle filter. 



Figure 9: Results of Typesetting the scores. 
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10 20 30 40 50 60 70 -10 10 20 30 40 50 60 70 80 



(a) Professional Jazz Pianist (b) Amateur 




-10 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 



(c) Professional Classical Pianist. The (d) Tracking at twice the rate of the 
filter temporarily loses track. original tempo. 



Figure 10: Examples of filtered estimates of Zo-k = [i~k, ^k] T from the Beatles data set. Circles 
denote the mean of p(2A|7°. r j. gmal , Uo-.k) and "x" denote mean p(^fc|7*.j., Uo-.k) obtained by 
SMC. It is interesting to note different timing characteristics. For example the classical 
pianist uses a lot more tempo fluctuation than the professional jazz pianist. Jazz pianist 
slows down dramatically at the end of the piece, the amateur "rushes", i.e., constantly 
accelerates at the beginning. The tracking and quantization results for (a) and (b) 
are satisfactory. In (a), the filter loses track at the last two notes, where the pianist 
dramatically slows down. In (c), the filter loses track but catches up again. In (d), the 
filter jumps to a metrical level that is twice as fast as the original performance. That 
would translate to a duplication in note durations only. 
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Model Dimension 



(a) Likelihood. The dashed horizontal line shows the median 
likelihood of the original score of Michelle under each model. 



Model Dimension 



(b) Edit Distance 



Figure 11: SMC results on the test data (108 performances of Michelle). For each model we show 
the results obtained with N = 1, 10,20 and 50 particles. The "-" show the median of 
the best particle and "x" denote the median after applying iterative improvement. The 
vertical bars correspond to the interval between %25 and %75 quantiles. 
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Monte Carlo (MCMC) and sequential Monte Carlo (SMC) to approximate the respective 
distributions. 

The quantization model we propose is similar to that of (Raphael, 2001a). For transcrip- 
tion, Raphael proposes to compute arg maxp(co : _ftr, zq : k\uq:k) and uses a message propaga- 
tion scheme that is essentially analogous to Rao-Blackwellized particle filtering. To prevent 
the number of kernels from explosion, he uses a deterministic selection method, called 
"thinning". The advantage of Raphael's approach is that the joint MAP trajectory can 
be computed exactly, provided that the continuous hidden state z is one dimensional and 
the model is in a parameter regime that keeps the number of propagated Gaussian kernels 
limited, e.g., if R is small, thinning can not eliminate many kernels. One disadvantage is 
that the number of kernels varies depending upon the features of the filtering distribution; 
it is difficult to implement such a scheme in real time. Perhaps more importantly, sim- 
ple extensions such as increasing the dimensionality of z or introducing nonlinear it ies to 
the transition model would render the approach quickly invalid. In contrast, Monte Carlo 
methods provide a generic inference technique that allow great flexibility in models one can 
employ. 

We have tested our method on a challenging artificial problem (clave example). SMC 
has outperformed MCMC in terms of the quality of solutions, as measured in terms of the 
likelihood as well as the edit distance. We propose the use of SMC for both problems. For 
finding the MAP quantization, we propose to apply iterative improvement (II) to the SMC 
solution on the reduced configuration space. 

The correct choice of the score prior is important in the overall performance of the 
system. Most music pieces tend to have a certain rhythmical vocabulary, that is certain 
rhythmical motives reoccur several times in a given piece. The rhythmic structure depends 
mostly upon the musical genre and composer. It seems to be rather difficult to devise 
a general prior model that would work well in a large spectrum of styles. Nevertheless, 
for a given genre, we expect a simple prior to capture enough structure sufficient for good 
transcription. For example, for the Beatles dataset, we have estimated the prior by counting 
from the original score of "Yesterday" . The statistics are fairly close to that of "Michelle" . 
The good results on the test set can be partially accounted for the fact that both pieces 
have a similar rhythmical structure. 

Conditioned on the score, the tempo tracking model is a linear dynamical system. We 
have optimized several tempo models using EM where we have varied the dimension of 
tempo variables z. The test results suggest that increasing the dimensionality of z improves 
the likelihood. However, increase in the likelihood of the whole dataset does not translate 
directly to overall better quantization results (as measured by edit distance). We observe 
that models trained on the whole training data fail consistently for some subjects, especially 
professional classical pianists. Perhaps interestingly, if we train "custom" models specifically 
optimized for the same subjects, we can improve results significantly also on test cases. 
This observation suggests a kind of multimodality in the parameter space where modes 
correspond to different performer regimes. It seems that a Kalman filter is able to capture 
the structure in expressive timing deviations. However, when averaged over all subjects, 
these details tend to be wiped out, as suggested by the quantization results that do not 
vary significantly among models of different dimensions. 
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A related problem with the edit distance measure is that under an "average" model, the 
likelihood of the desired score (e.g., original score of "Michelle") may have a lower likelihood 
than a solution found by an inference method. In such cases increasing the likelihood may 
even decrease the edit distance. In some test cases we even observe solutions with a higher 
likelihood than the original notation where all notes are wrong. In most of these cases, the 
tempo trajectory of the solution correspond to the half or twice of the original tempo so 
consequently all note durations are halved or doubled (e.g., all whole notes are notated as 
half notes, all half notes as quarters e.t.c). Considering the fact that the model is "self 
initializing" its tempo, that is we assume a broad uncertainty a-priori, the results are still 
satisfactory from a practical application perspective. 

One potential shortcoming of our model is that it takes only timing information of onsets 
into account. In reality, we believe that pitch and melodic grouping as well as articulation 
(duration between note onsets and offsets) and dynamics (louder or softer) provide useful 
additional information for tempo tracking as well as quantization. Moreover, current model 
assumes that all onsets are equally relevant for estimation. That is probably in general not 
true: for example, a kick-drum should provide more information about the tempo than a 
flute. On the other hand, our simulations suggest that even from such a limited model one 
can obtain quite satisfactory results, at least for simple piano music. 

It is somewhat surprising, that SMC, basically a method that samples from the filtering 
distribution outperforms an MCMC method such as SA that is specifically designed for 
finding the MAP solution given all observations. An intuitive explanation for relatively 
poorer MCMC results is that MCMC proceeds first by proposing a global solution and then 
tries to improve it by local adjustments. A human transcriber, on the other hand, would 
listen to shorter segments of music and gradually write down the score. In that respect, 
the sequential update schema of SMC seems to be more natural for the rhythm transcrip- 
tion problem. Similar results, where SMC outperforms MCMC are already reported in the 
literature, e.g., in the so-called "Growth Monte Carlo" for generating self-avoiding random 
walks (Liu, Chen, Sz Logvinenko, 2001). It seems that for a large class of dynamical prob- 
lems, including rhythm transcription, sequential updating is preferable over batch methods. 

We note that theoretical convergence results for SA require the use of a logarithmic 
cooling schedule. It seems that our cooling schedule was too fast to meet this requirement; 
so one has to be still careful in interpreting the poor performance as a negative SA result. 
We maintain that by using a richer neighborhood structure in the configuration space (e.g., 
by using a block proposal distribution) and a slower cooling schedule, SA results can be 
improved significantly. Moreover, MCMC methods can be also be modified to operate 
sequentially, for example see (Marthi, Pasula, Russell, &: Peres, 2002). 

Another family of inference methods for switching state space models rely on determinis- 
tic approximate methods. This family includes variational approximations (Ghahramani & 
Hinton, 1998) and expectation propagation (Heskes, 2002). It remains an interesting open 
question whether deterministic approximation methods provide an advantage in terms of 
computation time and accuracy; in particular for the quantization problem and for other 
switching state space models. A potential application of the deterministic approximation 
techniques in a MCMC schema can be in designing proposal distributions that extend over 
several time slices. Such a schema would circumvent the burden for computing the optimal 
proposal distribution exhaustively hence allowing more global moves for the sampler. 
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Our current results suggest the superiority of SMC for our problem. Perhaps the most 
important advantage of SMC is that it is essentially an "anytime" algorithm; if we have 
a faster computer we can increase the number of particles to make use of the additional 
computational power. When computing time becomes short one can decrease the number 
of samples. These features make SMC very attractive for real-time applications where one 
can easily tune the quality/computation-time tradeoff. 

Motivated by the practical advantages of SMC and our positive simulation results, we 
have implemented a prototype of SMC method in real-time. Our current computer system 
(a 800 MHz P3 laptop PC running MS Windows) allows us to use up to 5 particles with 
almost no delay even during busy passages. We expect to significantly improve the efficiency 
by translating the MATLAB® constructs to native C code. Hence, the method can be used 
as a tempo tracker in an automatic interactive performance system and as a quantizer in 
an automatic score typesetting program. 
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Appendix A. A generic prior model for quantization locations c 

In traditional western music notation, note durations are generated by recursive subdivisions 
starting from a whole note, hence it is also convenient to generate quantization locations 
in a similar fashion by regular subdivisions. We decompose a quantization location into an 
integer part and a fraction: c = |_cj + (c mod 1). For defining a prior, we will only use the 
fraction. 

The set of all fractions can be generated by recursively subdividing the unit interval 
[0,1). We let S = [si] denote a subdivision schema, where [si] is a (finite) sequence of 
arbitrary integers (usually small primes such as 2,3 or 5). The choice of a particular S 
depends mainly on the assumed time signature. We generate the set of fractions C as 
follows: At first iteration, we divide the unit interval into si intervals of equal length and 
append the endpoints c' of resulting intervals into the set C. At each following iteration i, 
we subdivide all intervals generated by the previous iteration into Sj equal parts and append 
all resulting endpoints to C. Note that this procedure generates a regular grid where two 
neighboring grid points have the distance 1/ s i- We denote the iteration number at which 
the endpoint c' is first inserted to C as the depth of c' (with respect to S). This number will 
be denoted as d(c'\S). It is easy to see that this definition of d coincides with the number 
of significant bits to represent c mod 1 when S = [2, 2, . . . ]. 

As an illustirative example consider the subdivision S = [3, 2]. At the first iteration, the 
unit interval is divided into s\ = 3 equal intervals, and the resulting endpoints 0, 1/3, and 
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2/3 are inserted into C with depths g?(0) = d(l/3) = c?(2/3) = 1. At the second iteration, 
the new endpoints 1/6, 3/6 and 5/6 are inserted to C and are assigned the depth 2. 
Given an 5, we can define a distribution on quantization locations 

p(ck\S) oc exp(— \d(ck mod 1|<S)) 

If we wish to consider several time signatures, i.e., different subdivision schemata, we can 
interpret S as a hidden indicator variable and define a prior p{S). In this case, the prior 
becomes a multinomial mixture given by p(cj,.) = Yls P( c k\^)p{S) . For further details and 
empirical results justifying such a choice see (Cemgil et al., 2000). 

Appendix B. Derivation of two pass Kalman filtering Equations 

Consider a Gaussian potential with mean \i and covariance £ defined on some domain 
indexed by x. 

</>{x) = Z x Af{n,H) = Z|2^S|-^exp(-^(a;-^) T E- 1 (a;-/i)) (28) 

where J dx<f>{x) = Z > 0. If Z = 1 the potential is normalized. The exponent in Eq. 28 is 
a quadratic form so the potential can be written as 

4>(x) = exp(c/ + h T x - ^x T Kx) (29) 

where 

K = V- 1 /i=5TV g = \ogZ + llog\^\-h T K- 1 h 

To denote a potential in canonical form we will use the notation 

4>(x) = Z x Af(n, E) = [h, K, g] 

and we will refer to g, h and K as canonical parameters. Now we consider a Gaussian 
potential on {x\,X2) T ■ The canonical representation is 

*(*i,s 2 ) = k kZ)^_ 

In models where several variables are interacting, one can find desired quantities by applying 
three basic operations defined on Gaussian potentials. Those are multiplication, condition- 
ing, and marginalization. The multiplication of two Gaussian potentials on the same index 
set x follows directly from Eq. 29 and is given by 

4>'{x) = 4> a {x) X (j) b (x) 
[h',K',g'\ = [h a ,K a ,g a ]x[h b ,K b ,g b ] = [h a + h b ,K a + K b ,g a +g b ] 

If the domain of cj) a and cj) b only overlaps on a subset, then potentials are extended to the 
appropriate domain by appending zeros to the corresponding dimensions. 
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The marginalization operation is given by 

4>{xi) = I 4>(x u x 2 ) = [hi - K 12 K 22 h 2 ,K u - K 12 K 22 1 K 2 i,g'] 

where g' = g—\ log \K 22 /2T:\ + ^h 2 T (K 22 )~ l h 2 and g is the initial constant term of <j)(xi, x 2 ). 
The conditioning operation is given by 

4>{xi,x 2 = x 2 ) = [hi - K 12 x 2 ,K n ,g'] 

where g' = g + h 2 x 2 — ^x 2 K 22 x 2 . 

B.l The Kalman Filter Recursions 

Suppose we are given the following linear model subject to noise 



Vk 



Az k _i + Ck 
Cz k + e k 



where A and C are constant matrices, Q k ~ A/"(0, Q) and e k ~ A/"(0, R) 
The model encodes the joint distribution 



p{zi:K,yi:K) 

P(zi\z ) 



K 

\\p{yk\zk)p{zk\zk-i) 

k = l 

P{z\) 



(30) 
(31) 



P{z\) 

p{yi\z\) 
p{yi = m\zi) 

p{z 2 \zi) 



[P~ V, P~\ ~\ log \2nP\ - ^ T P"V] 



C T R- X C -C T R- X 
-R~ l C R~ l 



log |2tI\R| 



[0 + (FR^m, C T R- X C, - l - log \2nR\ - ^yfR-'m] 



A T Q~ l A -A T Q- 1 
-Q~ l A Q- 1 



log |2ttQ| 



B.l.l Forward Message Passing 
Suppose we wish to compute the likelihood 



p{y\-.K) = I p{y K \z K )-- - / p{z3\z 2 )p{y 2 \z 2 ) / p{z 2 \zi)p{yi\zi)p{zi) 

7 We can compute this integral by starting from z\ and proceeding to zk- We define forward 

"messages" a as 



7. We let J z = Jdz 
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• «i|o =p{z\) 

• k = l:K 

~ &k\k = P(Vk = yk\zk)a k \k-i 
~ &k+i\k = ! Zk P{zk+i\zk)a-k\k 

The forward recursion is given by 

. ai | = [P- V, P~\ ~\ log \2nP\ - \n T P- l lA 

• k = 1 . . . K 

~ a k\k = [hk\ki K k\ki9k\k] 

hk\k = C T R~ 1 y k + h k \ k _i 

Kk\k = C T R 1 C + ifjfc|fc_i 

9k\k = 9k\k-i ~ \ log |2ttR| - \y[R- x y k 

~ a k+l\k = [hk+l\k-> Kk+l\ki9k+l\k] 

M k = (A T Q- 1 A + K klk )- 1 

hk+i\k = Q~ l AM kh k \ k 

K k+ i\ k = Q- 1 - Q~ 1 AM k A T Q~ 1 

9k+i\k = 9k\k~\ log |2ttQ| + \ log |27rM fc | + \h T k ^M k h k \ k 

B.1.2 Backward Message Passing 

We can compute the likelihood also by starting from y^-. 

p{yi:i<) = \ p{z\)p{yi\zi) I p(z 2 \z 1 )p(y 2 \z 2 ) ■ ■ ■ / p{z k \z K -x)p{vk\zk) 

J Z\ J Z2 J Zk 

In this case the backward propagation can be summarized as 

• Pk\k+i = 1 

• k = K ... 1 

- Pk\k = P(Vk = Vk\zk)Pk\k+i 
~ Pk-i\k = I Zk P{zk\zk-\)Pk\k 

The recursion is given by 

• k = K...l 

- Pk\k = [ h i\ki K k\ki9* k \ k ] 

K\ k = c T R- l y k + K lk+1 

Kt lk = C T R- 1 C + K; lk+1 
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9t\k = ~\ lo § l 27ri? l " \Vk R l Vk + 9l\ k+l 
~ Pk-l\k = Wk-l\ki K k-l\ki9*k-l\k\ 

M* k = (Q-' + KZ lk )-t 
K-i\ k = AT Q~ lM k h t\k 

K* k _ llk = A T Q-HQ-M* k )Q-'A 

9l-i\ k =9l\k-\ log Ml + \ log \2nM*\ + 

B.2 Kalman Smoothing 

Suppose we wish to find the distribution of a particular z k given all the observations yi.K- 
We just have to combine forward and backward messages as 

P(zk\yi:K) OC p{yk+l:K,Zk,yi:k) 

= p(Vl:k, Zk)p(yk + l:K\zk) 

= a k \ k X Pk\k+l 

= [hk\k + K\k+v K k\k + K k\k+v9k\k + 9k\k+i\ 

Appendix C. Rao-Blackwellized SMC for the Switching State space 
Model 

We let i = 1 ... N be an index over particles and s = 1 . . . S an index over states of 7. We 
denote the (unnormalized) filtering distribution at time k — 1 by 

4>k-l = P(l/0:Jfc-l,«Jfc-l|7i!fc-l) 

Since yo:fc-i are observed, 4*k-i * s a Gaussian potential on Zk-i with parameters Z k ^ x 
•^"{f^k-v ^i-i )■ Note that the normalization constant Z k \ x is the data likelihoodp(yo:fc-i|7i^_i) = 
/ dz k <^ k _ v Similarly, we denote the filtered distribution at the next slice conditioned on 

Ik = s by 

<t> k ^ = Jdz k - 1 p(y k \z k )p(z k \z k _ 1 ,j k = s)(f) ( k ) _ 1 (32) 
= p{yo:k,Zk\'Yil-i,'Yk = s) 

We denote the normalization constant of by Z^ l \ Hence the joint proposal on s and 
(i) is given by 

1k V) = J ' d Zk<t>k M x P(lk = s,7i!fc_i) 

= P{lk = S,7i!l-l'f0:fc) 

The outline of the algorithm is given below: 
• Initialize. For % = 1 ... TV, <— p(yo, ^0) 
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• For k = 1 ... K 

- For i = 1 . . . N, s = 1 . . . S 

Compute 0£ S '^ from ^k-i usm g Eq.32. 

- For i = 1 . . . N 

Select a tuple (s\j) ~ 
Til ^~ (Tifi-i)7A; = s ) 

Note that the procedure has a "built-in" resampling schema for eliminating particles 
with small importance weight. Sampling jointly on (s\i) is equivalent to sampling a single 
s for each i and then resampling i according to the weights w£ . One can also check that, 
since we are using the optimal proposal distribution of Eq.27, the weight at each step is 
given by = p{-y[ l ) k _ v yo-.k)- 
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